Data Mining - Computer Assignment 3

Mohammad Saadati - 810198410

Introduction

In 2014, a team of researchers collected data on diabetes patients from dozens of hospitals and clinics in the United States in an article entitled Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records. Some of this information has been made available to the public (Anonymously), which includes one hundred thousand items with fifty features.

In this project, we use clustering algorithms such as K-means and DBSCAN to cluster the given data.

Import Libraries

First of all, we import necessary libraries to use their functions.

Part 1: Preprocessing

Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into an understandable format for ML algorithms. Real-world data usually is noisy (contains errors, outliers, duplicates), incomplete (some values are missed), could be stored in different places and different formats. The task of Data Preprocessing is to handle these issues.

In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and Training / Tunning Model.

Pre-processing is one of the most important steps in data mining projects. Various approaches are used in the field of lost data management and data conversion to other formats, and the careful selection of these approaches has a direct impact on the quality of the final results; Therefore, the best approach should always be identified and applied.

Importance of Data Preprocessing stage

  1. Different ML models have different required input data (numerical data, images in specific format, etc). Without the right data, nothing will work.
  2. Because of “bad” data, ML models will not give any useful results, or even may give wrong answers, that may lead to wrong decisions (GIGO principle).
  3. The higher the quality of the data, the less data is needed.

First we load csv file as a DataFrame using pandas library.

In this part first in each column convert the values ? to NAN then we use two Pandas function isna and sum to count number of missing values at each column.

The describe() function returns descriptive statistics for every column of DataFrame.

The info() function returns a summary of DataFrame including data type and non-null values count of each column and also memory usage.

In this part, we plot the distribution of features.

Histograms are the simplest way to show how data is spread.

K-means input data requirements:

Besides the requirements above, there are a few fundamental model assumptions:

These assumptions are beyond the data preprocessing stage. There is no way to validate them before getting model results.

Stages of Data preprocessing for K-means Clustering

  1. Data Cleaning
    • Removing duplicates
    • Removing irrelevant observations and errors
    • Removing unnecessary columns
    • Handling inconsistent data
    • Handling outliers and noise
  2. Handling missing data

  3. Data Integration

  4. Data Transformation

    • Feature Construction
    • Handling skewness
    • Data Scaling
  5. Data Reduction

    • Removing dependent (highly correlated) variables
    • Feature selection
    • PCA

Data Cleaning: Removing unnecessary columns

Columns that don’t actually fit the specific problem that you’re trying to solve.

We drop columns that have many missing values.

We also drop rows that contain NAN values.

Data Cleaning: Handling inconsistent data

There are could be various situations here. At first, you need to observe your data.In truth, this operation relates to the transformation step. But I do it here, because at the step of handling missing values, it's often convenient not to have inconsistent data.

There are different ways for handling categirical data.

Label Encoders aren't a good option when we have no particular ordering in our categories. In these cases, we can use One Hot Encoding which takes a lot of memory since it is adding a new column for each new category. We use Ordinal Encoding because its give us better features.

Data Transformation: Data Scaling

We use Normalization to put all the features in the same range.

min_max scaler is a way to get data in the range 0 to 1.

Data Cleaning: Removing unnecessary columns

Columns that don’t actually fit the specific problem that you’re trying to solve.

The nunique function of pandas.DataFrame Count number of distinct elements in specified axis. Return Series with number of distinct elements. Can ignore NaN values.

We count number of distinct elements in each columns and then drop the columns that have one distinct element.

Data Cleaning: Handling (drop) outliers and noise

Note: dropping is only one of techniques to handle with outliers.

3 means that 99.7% of the data is saved to get more smooth data, you can set 2 or 1 for this value.

You can think of this problem as basically trying to find dense areas inside a cloud with noise.

This is not the only possible solution but you could use a clustering algorithm, and specifically one that tries to find dense areas such as DBSCAN.

Data Reduction: Removing dependent (highly correlated) variables

Data Reduction: PCA

Some algorithms such as KMeans find it difficult to accurately construct clusters if the dataset has too many features (ie. high dimensionality). High dimensionality does not necessarily mean hundreds or even thousands of features. Even 10 features can create accuracy issues.

The theory behind feature or dimensionality reduction is to convert the original feature set into fewer artificially derived features which still maintain most of the information encompassed in the original features.

One of the most prevalent feature reduction techniques is Principal Component Analysis or PCA. PCA reduces the original dataset into a specified number of features which PCA calls principal components. We have to select the number of principal components we wish to see. We discuss feature reduction in my article on KMeans clustering and I strongly advise you to take a look.

Do it if you still have a lot of variables or you can't select variables manually or segmentation results isn't good

0.95 means that we will save 95% useful information you can play with this parameter for getting better results

Part 2: Clustering

Silhouette

The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters. by providing a succinct graphical representation of how well each object has been classified.

The silhouette coefficient is a measure of how similar a data point is within-cluster (cohesion) compared to other clusters (separation).

The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.

The equation for calculating the silhouette coefficient for a particular data point:

$$S(i) = \dfrac{b(i)-a(i)}{max(a(i), b(i))} $$

We will then calculate the average_silhouette for every k.

$$ AverageSilhouette = mean(S(i)) $$

Then plot the graph between average_silhouette and K.

Points to remember while calculating silhouette coefficient:

There are two things to consider here:

K-means

K-means clustering is an unsupervised algorithm. In an unsupervised algorithm, we are not interested in making predictions (since we don’t have a target/output variable). The objective is to discover interesting patterns in the data, e.g., are there any subgroups or clusters among the banks customers?

Clustering techniques use raw data to form clusters based on common factors among various data points. Customer segmentation for targeted marketing is one of the most vital applications of the clustering algorithm.

Hyperparameters are model configurations properties that define the model and remain constants during the training of the model. The design of the model can be changed by tuning the hyperparameters. For K-Means clustering there are 3 main hyperparameters to set-up to define the best configuration of the model:

Initial values of clusters greatly impact the clustering model, there are various algorithms to initialize the values. Distance measures are used to find points in clusters to the cluster center, different distance measures yield different clusters.

The number of clusters (k) is the most important hyperparameter in K-Means clustering. If we already know beforehand, the number of clusters to group the data into, then there is no use to tune the value of k. For example, k=10 for the MNIST digit classification dataset.

Let us see the python code with help of an example.

We see that the silhouette score is maximized at k = 3. So, we will take 3 clusters.

NOTE: The silhouette Method is used in combination with the Elbow Method for a more confident decision.

In k-means clustering, the number of clusters that you want to divide your data points into i.e., the value of K has to be pre-determined whereas in Hierarchical clustering data is automatically formed into a tree shape form (dendrogram).

So how do we decide which clustering to select? We choose either of them depending on our problem statement and business requirement.

Hierarchical clustering gives you a deep insight into each step of converging different clusters and creates a dendrogram. It helps you to figure out which cluster combination makes more sense.

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].

Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.

In this example the silhouette analysis is used to choose an optimal value for n_clusters. The silhouette plot shows that the n_clusters value of 3, 5 and 6 are a bad pick for the given data due to the presence of clusters with below average silhouette scores and also due to wide fluctuations in the size of the silhouette plots. Silhouette analysis is more ambivalent in deciding between 2 and 4.

Also from the thickness of the silhouette plot the cluster size can be visualized. The silhouette plot for cluster 0 when n_clusters is equal to 2, is bigger in size owing to the grouping of the 3 sub clusters into one big cluster. However when the n_clusters is equal to 4, all the plots are more or less of similar thickness and hence are of similar sizes as can be also verified from the labelled scatter plot on the right.

Analysis

Here is the Silhouette analysis done on the above plots to select an optimal value for n_clusters.

In above all pictures , we can clearly see that how plot and score are different according to n_cluster(k) . So, we can easily choose high score and number of k via silhouette analysis technique instead of elbow technique.

Conclusion:

K-means clustering is a simplest and popular unsupervised machine learning algorithms . We can evaluate the algorithm by two ways such as elbow technique and silhouette technique . We saw differences between them above . I think silhouette technique gives us more precise score and number of k for k-means algorithm . However , we can also use elbow technique for quick response and intuition.

Here is the summary of what you learned in this post in relation to silhouette score concepts:

DBSCAN

Density-based spatial clustering of applications with noise (DBSCAN) is an unsupervised clustering ML algorithm. Unsupervised in the sense that it does not use pre-labeled targets to cluster the data points. Clustering in the sense that it attempts to group similar data points into artificial groups or clusters. It is an alternative to popular clustering algorithms such as KMeans and hierarchical clustering.

KMeans vs DBSCAN

KMeans is especially vulnerable to outliers. As the algorithm iterates through centroids, outliers have a significant impact on the way the centroids moves before reaching stability and convergence. Furthermore, KMeans has problems accurately clustering data where the clusters are of different sizes and densities. K-Means can only apply spherical clusters and its accuracy will suffer if the data is not spherical. Last but not least, KMeans requires us to first select the number of clusters we wish to find. Below is an example of how KMeans and DBSCAN would individually cluster the same dataset.

On the other hand, DBSCAN does not require us to specify the number of clusters, avoids outliers, and works quite well with arbitrarily shaped and sized clusters. It does not have centroids, the clusters are formed by a process of linking neighbor points together.

How Does DBSCAN do its Magic?

First, let’s define Epsilon and Minimum Points, two required parameters when applying the DBSCAN algorithm, and some additional terminology.

  1. Epsilon (ɛ): Max radius of the neighborhood. Data points will be valid neighbors if their mutual distance is less than or equal to the specified epsilon. In other words, it is the distance that DBSCAN uses to determine if two points are similar and belong together. A larger epsilon will produce broader clusters (encompassing more data points) and a smaller epsilon will build smaller clusters. In general, we prefer smaller values because we only want a small fraction of data points within the epsilon distance from each other. Too small and you’ll begin to split valid clusters into smaller and smaller clusters. Too big and you’ll aggregate valid clusters into 2–3 massive clusters with little business insights.
  2. Minimum Points (minPts): The minimum number of data points within the radius of a neighborhood (ie. epsilon) for the neighborhood to be considered a cluster. Keep in mind that the initial point is included in the minPts. A low minPts help the algorithm build more clusters with more noise or outliers. A higher minPts will ensure more robust clusters but if it’s too large smaller clusters will be incorporated into larger clusters.

Additional Terminology

Core Points: Core data points have at least minPts number of data points within their epsilon distance.

Border Points: Border data points are on the outskirts as they are in the neighborhood (ie. w/in epsilon distance of core point) but have less than the required minPts.

Outlier Points: These points are not part of a neighborhood (ie. more than epsilon distance) and are not border points. These are points located in low-density areas.

First, a random point is selected which has at least minPts within its epsilon radius. Then each point that is within the neighborhood of the core point is evaluated to determine if it has the minPts nearby within the epsilon distance (minPts includes the point itself). If the point does meet the minPts criteria it becomes another core point and the cluster expands. If a point does not meet the minPts criteria it becomes a border point. As the process continues a chain begins to develop as core point “a” is a neighbor of “b” which is a neighbor or “c” and so on. The cluster is complete as it becomes surrounded by border points because there are no more points within the epsilon distance. A new random point is selected and the process repeats to identify the next cluster.

How to Determine Optimal Epsilon Value

One method used to estimate the optimal epsilon value is to use nearest neighbor distances. If you recall, nearest neighbors is a supervised ML clustering algorithm which clusters new data points based on their distance from other “known” data points. We train a KNN model on labeled training data to determine which data points belong to which cluster. Then when we apply the model to new data, the algorithm determines which cluster the new data point belongs to based on the distance to trained clusters. We do have to determine the “k” parameter a-priori which specifies how many closest or nearest neighboring points the model will consider before assigning the new data point to a cluster.

To determine the best epsilon value, we calculate the average distance between each point and its closest/nearest neighbors. We then plot a k-distance and choose the epsilon value at the “elbow” of the graph. On the y-axis, we plot the average distances and the x-axis all the data points in your dataset.

if epsilon is chosen much too small, a large part of the data will not be clustered, whereas a high epsilon value clusters will merge and the majority of data points will be in the same cluster. In general, small values of epsilon are preferable, and as a rule of thumb, only a small fraction of points should be within this distance of each other.

How to Determine Optimal minPts

Typically, we should set minPts to be greater or equal to the number of dimensionality of our dataset. That said, we often see folks multiplying the number of features X 2 to determine their minPts value.

Much like the “Elbow Method” used to determine the optimal epsilon value the minPts heuristic isn’t correct 100% of the time.

Pros of DBSCAN

Cons of DBSCAN

We can see that iterating through our epsilon and minimum values we had obtained a wide range of number of clusters and silhouette scores. Epsilon scores between 0.1 and 0.25 began to produce a manageable number of clusters. Increasing epsilon to 1.2 and above creates too few clusters to make much business sense. Furthermore, some of those clusters may be just noise (ie. -1) which we’ll get to in a bit.

It is also important to understand that increasing epsilon decreases the number of clusters but each cluster will also begin to encompass more outlier/noise data points. There is a certain level of diminishing returns.

It is also important to point out a frequent error you will surely encounter running this string of code. Sometimes when you have set your parameters (ie. eps_values and min_samples) too broad the for-loop will eventually come to a combination of eps_values and min_samples which only produces one cluster. However, the Silhouette_score function requires at least two clusters to be defined. You will need to restrict your parameters to avoid this issue.

Summary

DBSCAN, a density clustering algorithm which is often used on non-linear or non-spherical datasets. Epsilon and Minimum Points are two required parameters. Epsilon is the radius within nearby data points that need to be in to be considered similar enough to begin a cluster. Finally, Minimum Points is the minimum number of data points that need to be inside the radius (ie. epsilon) before they can be considered a cluster.

Save clustering result into a csv file